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Introduction 


Throughout this course, we shall be interested in the analog to digital 
conversion of signals f(t), ¢ € IR. We shall always assume f € Lz and 
usually assume additional properties of f in order to get meaningful results. 
In particular, we want to study two mappings: the encoding of f into bit 
streams and the decoding of the bit streams into approximations or 
estimates of f, 

Equation: 


E: f-—bits streams (Encoder) 
D: bits streams > f (Decoder) 


where f is the approximation of f defined by f := D(E(f)). In general, 
f # f, so we shall need some way of quantifying how well f 
approximates f. Normally, the distortion between is measured by some 
norm || f — f ||. Typical choices include: 


Equation: 

1/2 
the Lj norm || fl; := |f(t)|? dt 
the L~ norm || f ||; :=sup |f(¢)| 

t 

1/p 

the Lp norm || f ||; := |f(t)|? dt 


The Shannon-Whitaker Sampling Theorem 


The classical theory behind the encoding analog signals into bit streams and 
decoding bit streams back into signals, rests on a famous sampling theorem 
which is typically refereed to as the Shannon-Whitaker Sampling Theorem. 
In this course, this sampling theory will serve as a benchmark to which we 
shall compare the new theory of compressed sensing. 


To introduce the Shannon-Whitaker theory, we first define the class of 
bandlimited signals. A bandlimited signal is a signal whose Fourier 
transform only has finite support. We shall denote this class as B and 
define it in the following way: 

Equation: 


om 


Ba :={f € L2(R): f(w) =0, |w| > Az}. 


Here, the Fourier transform of f is defined by 
Equation: 


e 1 —twt 
Fos a= [fs dt. 


This formula holds for any f € LZ, and extends easily to f € Lg via limits. 
The inversion of the Fourier transform is given by 
Equation: 


1 - dwt 
r(t) == [ Fe de. 


Shannon-Whitaker Sampling Theorem 


If f € Ba, then f can be uniquely determined by the uniformly spaced 


samples f oe and in fact, is given by 


A 
Equation: 


sin t 
roa 


where sinc (t) = 


It is enough to consider A = 1, since all other cases can be reduced to this 
through a simple change of variables. Because f € By-—y, the Fourier 
inversion formula takes the form 

Equation: 


1 re twt 
f= =e | Fee du. 


Define F(w) as the 27 periodization of f, 
Equation: 


=) fw — 2nz). 


neZ 


Because F'(w) is periodic, it admits a Fourier series representation 


Equation: 
Ww) = ) ene 
neZ, 


where the Fourier coefficients c, given by 
Equation: 


1 : 
rae F(wje”™ dw 


1 ne inw 
=5, | Fue dw. 


fe) 
3 
| 


By comparing ({link]) with ([link]), we conclude that 
Equation: 


f (n). 


Cn = 


1 
V/ 20 


Therefore by plugging ([link]) back into ({link]), we have that 
Equation: 


1 —inw 
F(w) = Veet 


Now, because 
Equation: 


~ 1 


f (w) = F(w)xt-na| = in f (nje Xa a 
neZ 


and because of the facts that 
Equation: 


we conclude 
Equation: 


Comments: 


1. (Good news) The set {sinc (z(t — n))}nez is an orthogonal system 
and therefore, has the property that the LZ norm of the function and its 
Fourier coefficients are related by, 

Equation: 


2 


lf Z,= 27> > 


neEZ 


f (n) 


2. (Bad news) The representation of f in terms of sinc functions is not a 
stable representation, i.e. 
Equation: 


1 
S | |sinc (x(t —n))| © >, —_—_——— -—> divergences 


neZ neZ It > n| +1 


Optimal Encoding 


We shall consider now the encoding of signals on [—T’, T] where T > 0 is 
fixed. Ultimately we shall be interested in encoding classes of bandlimited 
signals like the class B4 However, we begin the story by considering the 
more general setting of encoding the elements of any given compact subset 
K of a normed linear space X. One can determine the best encoding of K 
by what is known as the Kolmogorov entropy of AK in X. 


To begin, let us consider an encoder-decoder pair (EZ, D) FE maps K toa 
finite stream of bits. D maps a stream of bits to a signal in X. This is 
illustrated in [link]. Note that many functions can be mapped onto the same 


bitstream. 
Set K.— —_ i ae 
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Illustration of encoding and decoding. 


Define the distortion d for this encoder-decoder by 
Equation: 


d(K,E, D,X) :=sup sex || f -— D(Ef) lly. 


Letn (K, £) =sup sex#Ef where #Ef is the number of bits in the 
bitstream Ef. Thus n is the maximum length of the bitstreams for the 
various f € K. There are two ways we can define optimal encoding: 


1. Prescribe e, the maximum distortion that we are willing to tolerate. For 
this €, find the smallest 
n.(K,X) := inf (p,p) {n(K, E) : d(K, E, D, X) < eé}. This is the 
smallest bit budget under which we could encode all elements of K to 
distortion e€. 

2. Prescribe NV : find the smallest distortion d(K, £, D, X) over all E, D 
with n(K, E) < N. This is the best encoding performance possible 
with a prescribed bit budget. 


There is a simple mathematical solution to these two encoding problems 
based on the notion of Kolmogorov Entropy. 


Kolmogorov Entropy 


Set K 


Coverings of K by balls of radius e. 


Given e > 0, and the compact set K, consider all coverings of K by balls 
of radius €, as shown in [link]. In other words, 
Equation: 


KC UN oC fis é). 


Let N, := inf {N: over all such covers}. N, (K) is called the 
covering number of K. Since it depends on X and K, we write it as 
Ne= NK, X). 

Definition 

Kolmogorov entropy 


The Kolmogorov entropy, denoted by H, (K, X), of the compact set K in 
X is defined as the logarithm of the covering number: 
Equation: 


Af, (K, X) =log N. (K, X). 


The Kolmogorov entropy solves our problem of optimal encoding in the 
sense of the following theorem. 


For any compact set K C X, we haven, (K,X) = |H, (K, X)], where 
[-] is the ceiling function. 


Sketch: We can define an encoder-decoder as follows To encode: Say 
f € K. Just specify which ball it is covered by. Because the number of 


balls is N.(, X ,weneedatmost log N. K,X _ bits to specify any 
such ball ball. 


To decode: Just take the center of the ball specified by the bitstream. 


It is now easy to see that this encoder-decoder pair is optimal in either of 
the senses given above. 


The above encoder is not practical. However, the Kolmogorov entropy tells 
us the best performance we can expect from any encoder-decoder pair. 
Kolmogorov entropy is defined in the deterministic setting. It is the 
analogue of the Shannon entropy which is defined in a stochastic setting. 


Optimal Encoding of Bandlimited Signals 


We now turn back to the encoding of signals. We are interested in encoding 
the set 
Equation: 


Ba(M) ={f € Ba: |f(t)| < M,t € R} 


where JM is arbitrary but fixed. We shall restrict our discussion to the case 
where distortion is measured in DL, [|—T’, T|] where T > 0 is arbitrary but 
fixed. Then, B4 (M) is a compact subset of D..: Ba(M) C Lae |—T,T). 

Sampling times (AA) 


/ 


-T(1+8(T)) -T 0 T T(1+8(T)) 


Sample points 5"; are chosen in the interval [-T(1 + 6), T(1 + 6)]. 


We shall sketch how one can construct an asymptotically optimal 
encoder/decoder for B 4. The details for this construction can be found in 
[link]. 


We know f (w) = 0 for |w| > Az, and || < M. How can we encode f in 
practice? We begin by chosing A = A(T’) > 1 (see [link]) which will 
represent a slight oversampling factor we shall utilize. Given a target 
distortion « > 0, we choose k so that 2~*-! < € < 2-*. Given f, we shall 
encode f by first taking samples f(=2;) for 4 € [-T(1+ 6), T(1 + 9)| 
where 6(7’) > 0. In other words, we sample f on a slightly larger interval 
than [—T’, T]. For each sample f (55), we shall use the first k + ko (T’) 
bits of its binary expansion. In other words, our encoder takes f and the 


samples f (3) and then assigns to f (3) the first k + ko (T) bits of this 
number. 


To decode, the receiver would take the bits and construct the approximation 
f (+) to f ( *) from the bits provided. Notice that we have the 


accuracy 
Equation: 
n n 
ef f=) Sy, 
fl AA )-f ( a) - 
We utilize the function g) satisfying ([link]) to define 
Equation: 
n 
f(th= ), fl<z) gt — n), 
tga) 
where 
Equation: 
Nr:={n: -T(14+8)<— <T(1+08)} 
AA 
We then have 
Equation: 
n n 
=, ra aie Woe 2 Gee a 
FO-FOl < YO Fs) -F Gy) la @st—n) 


+ SO FS) la Qat—a) 


Sa or (1+6) 


The term f ( s+) — f (55) that appears in the first summation in ([link]) 
is bounded by M -2-*-*, The term f (53) that appears in the second 


summation in the same equation is bounded by M. Therefore, 
Equation: 


f(t) — f (| < So M-.2-**./g, (XAt — n)| 


neNr 


+ S- M-\g) (AAt — n)| =: S, + So 


| Sr 1>T (144) 


We can estimate S; by 


Equation: 
S = SY) M.2* lg) (At — n)| 
neNr 
< M.2-**. oy gy (AAt — n)| 
< M-Cy(dA)-2-** (because g(-) decays fast) 


Therefore, if we choose ko sufficiently large, then 
Si, <M-Co(A)- OBB ze + The second summation > can also be 
bounded by €/2 by using the fast decay of the function g) (see ([link])). 


To make the encoder/decoder specific we need to precisely define 6 and X. 
It turns out that the best choices (in terms of bit rate performance on the 
class By) depend on 7’. But 6r — 0 and Ar > 1 as T — oo. Recall that 
Shannon sampling requires 27’.A samples. Since our encoder/decoder uses 
k, + ko bits per sample, the total number of bits is (k + ko) -2AAT(1 + 4), 
and so coding will require roughly k bits per Shannon sample. 


This encoder/decoder can be proven to be optimal in the sense of averaged 
performance as we shall now describe. The average of performance of 
optimal encoding is defined by 


Equation: 


i n(By(M),L~|-T,T}) 
LQ 
T-0o 2T 


If we replace the optimal bit rate n, in ([link]) by the number of bits 
required by our encoder/decoder then the resulting limit will be the same as 
that in ([link]). 


In summary, to encode band limited signals on an interval [—T’, T], an 
optimal strategy is to sample at a slightly higher rate than Nyquist and on a 
slightly large interval than |—T’, T]. Each sample should then be quantized 
by using the binary expansion of the sample. In this way, for an investment 
of k bits per Nyquist rate sample, we get a distortion of 2-*. 


To get a feel for the number of bits required by such an encoder, let us say 
A = 10° (signals band limited to 1Mhz). Say 

T = 24 hours ~ 10° seconds, and k = 10 bits. Then, 

A-k-2T =10°-10- 10° = 10” bits. This is too BIG! 


The above encoding is is known as Pulse Coded Modulation (PCM). In 
practice, people frequently use another encoder called Sigma-Delta 
Modulation. Instead of oversampling just slightly, Sigma Delta over 
samples a lot and then assign only one (or a few) bits per sample. 


Why is Sigma-Delta preferred to PCM in practice? There are two reasons 
commonly given: 


1. Getting accurate samples, quantization, etc. is not practical because of 
noise. For better accuracy, we need more expensive hardware. 

2. Noise shaping. In Sigma-Delta, the distortion is higher but the 
distortion is spread over frequencies outside of the desired range. 


In PCM, the distortion decays exponentially (like 2~"), whereas for Sigma- 
Delta, the distortion decays like a polynomial (like ar) Although the 
distortion decays faster in PCM, the distortion in Sigma-Delta is spread 
outside the desired frequency range. 


Stable Signal Representations 


To fix the instability of the Shannon representation, we assume that the 
signal is slightly more bandlimited than before 
Equation: 


f (w) =0 for lw] > 4, 6 > 0, 


and instead of using x;_7,,], we multiply by another function g (w) which is 
very similar in form to the characteristic function, but decays at its 
boundaries in a smoother fashion (i.e. it has more derivatives). A candidate 
function g is sketched in [Link]. 


—T —1+6 0 n- T 


Sketch of g. 


Now, it is a property of the Fourier transform that an increased smoothness 
in one domain translates into a faster decay in the other. Thus, we can fix 
our instability problem, by choosing g so that g is smooth and g (w) = 1, 
]w| < m— and g = 0, |w| > 7. By choosing the smoothness of g suitably 
large, we can, for any given m > 1, choose g to satisfy 

Equation: 


for some constant C' > 0. 


Using such a g, we can rewrite ([link]) as 
Equation: 


f (w) = F()9() == dF (mye eG (w). 


Thus, we have the new representation 


Equation: 
= S~f(n) g(t—n), 


neEZ 


where we gain stability from our additional assumption that the signal is 
bandlimited on [—a — 6, a — 6]. 


Does this assumption really hurt? No, not really because if our signal is 
really bandlimited to [—7, x] and not [—z — 6, 7 — 6], we can always take a 
slightly larger bandwidth, say [—Az, Az] where 4 is a little larger than one, 
and carry out the same analysis as above. Doing so, would only mean 
slightly oversampling the signal (small cost). 


Recall that in the end we want to convert analog signals into bit streams. 
Thus far, we have the two representations 
Equation: 


=. f(n) sinc (x(t — n)), 


neZ 


Fig) 


neZ 


Shannon's Theorem tells us that if f € By, we should sample f at the 


Nyquist rate A (which is twice the support of f) and then take the binary 
representation of the samples. Our more stable representation says to 
slightly oversample f and then convert to a binary representation. Both 
representations offer perfect reconstruction, although in the more stable 
representation, one is straddled with the additional task of choosing an 
appropriate A. 


In practical situations, we shall be interested in approximating f on an 
interval |—T’, T| for some T > 0 and not for all time. Questions we still 
want to answer include 


1. How many bits do we need to represent f in B ,_; on some interval 
[—T, T] in the norm L,, |[-T, T]? 

2. Using this methodology, what is the optimal way of encoding? 

3. How is the optimal encoding implemented? 


Towards this end, we define 
Equation: 


Ba :={f € Lo (R) :|f(w)| = 0, |w| > An}. 


Then for any f € Bg, we can write 
Equation: 


f= Si -sinc m(At — n). 


-ATA At 0 Atk ATA 


Fourier transform of g) (-). 


In other words, samples at 0, ae oc --+ are sufficient to reconstruct f. 


Recall also that sinc (1) = ee) 


instability). We can overcome this problem by slight over-sampling. Say we 
over-sample by a factor A > 1. Then, we can write 
Equation: 


decays poorly (leading to numerical 


f= SO F(z) a at — 1). 


Hence we need samples at 0, +54, 2 sr: etc. What is the advantage? 
Sampling more often than necessary buys us stability because we now have 
a choice for gy (-). If we choose g) (-) infinitely differentiable whose 
Fourier transform looks as shown in [link] we can obtain 

Equation: 


Cy,k 


a ee 
(1+ |el)4 


gx(t)| < 


and therefore gy (-) decays very fast. In other words, a sample's influence is 
felt only locally. Note however, that over-sampling generates basis 
functions that are redundant (linearly dependent), unlike the integer 
translates of the sinc(-) function. 


-—-cT -T 8) db cT 


To reconstruct signals in |—T,, T], the sampling interval is [—cT’,, cT] 


If we restrict our reconstruction to t in the interval |[—T’, T], we will only 
need samples only from [—cT,, cT|, for c > 1 (see [link]), because the 
distant samples will have little effect on the reconstruction in |—T, T]. 


Preliminaries 


We previously described Shannon's Theorem plus encoding: the Nyquist 
sampling rate is the minimal required sampling rate to recover the entire 
class of bandlimited signals. We have seen that this sampling rate may be 
prohibitively large for broadband signals. We see a way to improve upon 
this situation: we will pose a different model for the signals which is more 
restrictive than the assumption that the signals are bandlimited. Fortunately, 
there are several real world scenarios in which one knows much more 
information about the signals of interest. For example, they may be written 
in terms of very few fundamental building blocks (such as sine waves or 
chirps). This leads us to define new signal classes based on notions of 
sparsity and seek to determine if we can improve on sampling and encoding 
in this new setting. 


Let us define the general setting for this section. Let X be a Banach space 
of functions. The typical examples are X = Ly (R), Ly (R“), Lp (—T,T), 
1 < p < co. We denote the norm on X by || - || x. We define a dictionary 
J as any collection of functions  C X such that || g ||y = 1 for all 

g € GY, ie. all the elements of the dictionary are normalized. While the 
definition is very broad, in practice dictionaries usually have more 
structure. Some examples include Z = B, a basis for X, such as (i) the 
Fourier basis on |—7, 77], (ii) a wavelet basis,[ footnote] (iii) redundant 


a(t—b)?_ioxr : 
(¢—) ee; 1.8; 


families of waveforms of the form Yap = e— 
D = {Wab,o}op.9° and (iv) wavelet packets. 
Wavelet basis form orthonormal systems for Dy (I). 


Definition 


We define the class of n-sparse signals as 

Un t= Un(D) = {8s = Yo geacgg, A C FD, HA <n}. We also say that s has 
sparsity n in Dif s € X, (Q), ie. if it can be written as the linear 
combination of n functions from Y. We note that %,, is not a linear space; 
we instead have U1, + Un C Yop. 


New Signal Models 


We now wish to consider new model classes for signals. Towards this end, 
let {Pj} in be an orthonormal basis for L2(—T',T’). Thus for f € L2 we 


can write f = 75°, cj(f); where (c;(f)) € £2. We will now build an 


encoder and decoder and analyze its performance on compact sets A’. For 
example, we might want to encode signals in the space 


Xp = {f:(ej(f)) € & },0<S ps2 
with norm 


ll fll x, =Il (e(F)) Il g- 
However, in this space the unit ball, U(X,,) is not compact. To get a 


compact set we need more structure on the sequence (c;). Hence we define 
bree Wore SS ani al ee 


and we define the norm in this space as || f ||,-. : = the smallest c such that 
this holds. We now take 


K =U(X,)nU(v*) 


to get a compact set. Notice that when a > 0 is small the requirement for 
membership in Y “ is very mild. 


Next, suppose that we choose a target distortion level ¢ = 2~™. Given f, 
let 


Ay: = Ai(f) = {9 € {0,...,N}:2-° <| ej(f) |< 2-*} 


for0 < k < M, where M: = ea . We then choose WN as the smallest 
integer so that 


Noo eae 


and thus 
log N < Cm. 


It follows from the requirement that f € Y° that A; C {1,...,N} for 
eachO <k< M. 


Recall that 


Phair Sle Fe | |B 


cjE Ag 
Since f € U(X,) NU(Y®), 
#Me < || fll, 200V? < 2000?, 


Hence, the total number of indices in all of the A;,,0 <k < M, is OO” ye 
To encode, for each f, we can send the following bits: 


e Send log n bits to identify each index in A,, for0 < k < M. This 
will require a total of O(log N2™?) bits. 

¢ Send one bit to identify the sign of c;(f) for each 7 € Ay,0O<k< M 
. This will require O(2?) bits. 

¢ Send m bits to describe each c;(f),7 € Ax, forO < k < M. This will 
require O(m2™?) bits. 


Thus the total number of bits used in the encoding is O(m2™?). 


Notice that for each j € Az, 0 < k < M, we can recover each c;(f) by 


Cj = ae 3 bok 
1=0 


where the sign is given by the sign bit. It follows that 
| cj(f) — cj |< 2-™* for every such coefficient. Here we have used the 


fact that knowing that 7 € A; means that the first nonzero binary bit of 
c;(f) is the k-th bit. 


To decode we simply set 


a YY ow, 


=0 jEAR 


We now analyze the error we have incurred in such an encoding. The square 
of the error will consist of two parts. The first corresponds to the 7 € Ag, 

0 <k < M. For each such j we have | c;(f) — c; |< 2~™ * and so the 
total square error for this is 


M 
< i ma a < cQ72m 
k=1 


because p < 2. The second part of the error corresponds to all the 

coefficients which have magnitude < 2-™_ We have that this sum does not 
Qe 9—-M(2- 00 “yy 

exceed )Ji.,j.9-m | €7 |"S 2 (2—p) dj—1 | Cj PS 27°”. Thus the total 


error we incur is O(2~™). 


In summary, by allocating O(m2 p12 ) bits we achieve distortion C2~™. 


peleirraey by allocating n log n bits, we achieve distortion 
Cn~(1/p-1/2) 


Remark 


This is within a logarithmic factor of the optimal encoding given by 
Kolmogorov entropy of the class U(X,) 1 Y®. A slightly more careful 
argument can remove this logarithm. 


Example: 
The Wavelet Basis 


In the method above we failed to achieve the optimal performance because 
of the cost involved in identifying which indices were in each Ax. We will 
now describe a method that can do better, using the Haar basis for LD2|0,1}. 
Thus, we first define the scaling function 


Yo X04) 
Next, we define the mother wavelet 
Y= X04) — X[44- 


We then define the remaining wavelets recursively. They are obtained by 
dilations and shifts of the mother wavelet on dyadic intervals: 


De ee 

br = > Vio (2°% — 3) 
where J = [j2-*,(7 + 1)2~*| are dyadic intervals. We denote by D, the 
collection of all dyadic intervals contained in [0,1]. Then, the collection of 
functions {y} U {#7} ,<p+ forms an orthonormal basis for L2[0,1]. 
A key property of wavelets is that a tree structure can be placed on the 
coefficients due to the use of dyadic intervals in their construction. Thus, 
let 


Oe etch eae 
and 
Tesi — Ty = Ag. 


We define 7, as the smallest tree containing 7}. Given any binary tree of 
size n, we can encode the tree with at most O(n) bits, in the process 
outperforming the encoder described in above. 


Sparse Approximation and £p Spaces 


We now look at how well f € X can be approximated by n functions in the 
dictionary Z. 


Definition 


We define the error of n-term approximation of f by the elements of the dictionary 


Bas 
(1) on(f)x = on(f, 9)x := infses, || f— || x. 
We also define the class of r-smooth signals in J as 
(2) B" :=P"(D):={f © X,on(f) < Mn for some M} 
with the corresponding norm || f || vr =SUPp—12,... #’On(f)x- 


In general, the larger r is, the 'smoother' the function s € @"(D)x. Note also that 
a” C of" ifr >r'. Given f, let r(f) = sup {r: f € o/"} be a measure of the 
"smoothness" of f, i.e. a quantification of compressibility. 


Let X = H, a Hilbert space[footnote] such as X = Ly (IR), and assume J = B- 
an orthonormal basis on X; i.e. if B= {¢;},, then (¢;, ¢;) = 6;,;, where 6;,; is the 
Kronecker delta. This also means that each f € X has an expansion 

f = Yo, ¢5 (f)b;, where c; (f) = (f, 65). We also have || f |Ix = 0321 le; (f)I’. 
A Hilbert space is a complete inner product space with the norm induced by the 
inner product 


Recall the definition of 2, spaces: let (a;) € IR; then (a;) € @, if || (a;) le, =. 


1/p 
with || (a3) lg, = (SQylaj!) for p < 00 and || (a;) |lp, =sup,; las for p = oo. 
We also recall that for L, spaces on compact sets, L, C Ly if p > p’. The opposite 
is true for £, spaces: £, C Ly if p < p’. Hence, the smaller the value of p is, the 
“smaller” £, is. 


Example: 


Does there exist a ecaMence nae with || (a5) ||, = 0; ja, < oo but with 


Q; = a; = = oo for all 0 < p < 1? Consider the sequence 
J ae q Ak 


ix- We see that (a,) € £1 but || (an) ||,, = 00 forallO <p <1. 


On = Tioga) 


i 
A sequence (a,,) is in 2, if the sorted magnitudes of the a,, decay faster thann ”. 


Define a, as the element of the sequence (a,,) with the n*® largest magnitude, and 
denote (a,,) as the decreasing rearrangement of (a,,). It is easy to show that 
k(a,)” < oq (an)? for alll k; also, if (a) € p, then ay < || (an) ll,k# 


Definition 


* 1 
A sequence (a) is in weak £,, denoted (a,,) € wy, if a, < Mk ».Wealso 
define the quasinorm [footnote] || (ay) ||,,,~, aS the smallest MM > 0 such that 


a, < Mk-> for each k. 

A quasinorm is has the properties of a norm except that the triangle inequality is 
replaced by the condition || x + y ||< Co ||| x || + || y ||] for some absolute 
constant Cp. 


Example: 
The sequence an, = A. is in weak @, but not in @. 


For p,p’ such that p’ > p, we have £, C wy C Lp. 


Let ZY = Bbe an orthonormal basis for the Hilbert space X = H. For f € X with 
representation in B = [¢1, ¢2,...] as f = 95, en (f) bn, we have f € A"(B)y if 
and only if the sequence (cy, (f)) € wé,, with 4 = r + 4. Moreover, there exist 

Co, Co € R such that Co|| (cn (F)) Ilwe, S Il f ilar S Coll (en (F)) lle, 


Example: 
eee = f € @7 if and only if (cp (f)) € wh, ie. ifc, (f) < Mn} = ™. 


n 


We prove the converse statement; the forward statement proof is left to se peader: 
We would like to show that if (cn (f)) € wé,, then f € #@", withr = + — 4. The 
best n-term approximation of f in B is of the form s = S),<, axbe, fA < n. 
Therefore, we have: 


ou(f)x = in I F~8 lx = Jf ll 3 (ex (A) ~ onde + 3 on (dr I 
. X 
(3) - SoG f)— ax)? + & (ex(N= Su fy 
< M? 3 kot <M? k-*"—1 (since (cy (f)) € wey), 


k=n+1 k=n+1 
where M :=|| (cn(f) || we,- 


We prove the converse statement; the forward statement proof is left to a race 
We would like to show that if (c, (f)) € wé,, then f € #@", withr = + — 4. The 
best n-term approximation of f in B is of the form s = S°,<, axbe, fA < n. 
Therefore, we have: 


on(f)x = inf | f—# lx = inf | D (ce (f) — an) + D en (Ae |I 
fi n keA kg x 
(3) = inf S (er(f) — an) + D(A = 2 Jel 
< M? 5 k-* <M? = k~?"-! (since (cn (f)) € wep), 


k=n+1 k=n+1 


where we define C' = ~ Using this result in the earlier statement, we get 


3 
(5) YS leg (f )|° Se Oi) keg eal M?zn: 
k=n+1 


this implies by definition that (cz, (f)) € #”. 


Thresholding and Greedy Bases 


We shall next discuss some notions related to best n-term approximation. 


Thresholding 


1. Let X be a Hilbert space. Given f, let A. (f) = {7 :|c; (f)| > e}. The 
thresholding operator T’ is defined by 
Equation: 


jEAe(f) 


It is easy to see that for each e, T. f is the best approximation to f 
using NV terms where NV is the cardinality of A: 
Equation: 


| f — Te ||x= on(f)x. 


Thresholding is easily implemented on a computer. 

2. The thresholding scheme above can be generalized if X is not a Hilbert 
space provided the dictionary has some specific structure. For 
example, when 


1. The dictionary is the wavelet basis and X = Ly, 1 <p<o. 
2. X = I, and the dictionary is the canonical basis 6; = y;. ex: 


C020; ctl 0) 
3. For a general Banach space X and the dictionary (y;) is a greedy 
basis. 
Greedy Bases 


We briefly describe the notion of greedy basis. 
Definition 


Given X, we say (y,) is a greedy basis for X if for each € > 0, 
Equation: 


|| f — Tef Ilx< C(X)on(f)x 
where JN is the cardinality of A,. 
Definition 


A basis ; is said to be unconditional if 
Equation: 


| So £639; IIx< C || So 639; IIx 


or equivalently 
Equation: 


| do cs Ix< Cll Do dy; lx = where es] < ||. 


This is an older concept from functional analysis. In words, this definition 
says that if the terms c; are rearranged, the series } > cj; will still 
converge. This is not generally true for all bases. 

Definition 


A basis y; is said to be democratic if 
Equation: 


Id Cel esl 


jEAa jEA! 
where the cardinality of A’ equals the cardinality of A. 
Remark 
(y,;) greedy ++ (y,) is both unconditional and democratic. 


Some examples involving the last two definitions: 


e The fourier basis in L, is not democratic, but is unconditional for 
l<p<o. 

e The wavelet basis contains both of these properties, and is therefore 
greedy. 


If X = Ly has (pj) greedy, B= {y;}, f = 0521 oy (S) 93, 
c; (f) = (f, 7) where w; is a dual basis, 


Equation: 
r 1 1 
PSE NCIS Mi. SS 
and 
Equation: 


Il F arr] e5(F) Ile, 


Let us now consider a specific setting that we shall be concerned with a lot 
in this course. We shall examine some of the concepts we have introduced 
in the finite dimensional space of of all sequence (points) in R’. Recall that 
we can put many different norms on this space including the 2, norms and 
the weak £,, norms. 

Remark 


Given a vector = (21, 22,...,2N) € RY. The best approximation to x 
from 2, in the £, norm is to take the vector in &’,, which shares the n 
largest values of x. Its error of approximation satisfies 


Equation: 
= 1 1 
On(x)e, < Cn | L ewe, = =r+— 
Remark 
[calles ea pe Pars: 


Example 


For p = landr = 3,0,(x)s, < Cn || z ||, and + = 4. In words, this 
equation shows what kind of 7 is needed for a given decay rate (or given 
some T, what kind of decay rate will be achieved) to approximate with 
certain ability. 

Example 


Show on(x),, < Cn™ || x 


1, holds with C' = 1. 


Proof: Let A, := {2 :|x; |largest}, 


Equation: 
one} = Sofa? 
i¢ An 
Equation: 
= S° xi|P "|x; |" 
i¢An 
Equation: 
aye 
< (Il 2 ln, 277)? (D> |a[") 
Equation: 
Selle li nan” || x llr 
and so 
Equation: 


on(a)? <n || a |? 


F,n(2)i,= 7" || @ |p, 
Remark 


For X = Lp, {p i a wavelet basis, we can say wavelet coefficients of f are 
in 1, is equivalent to f is in a certain Besov class (roughly speaking f has r 


derivatives and f Ole L,). We refer the reader to [link] for precise 
formulations of results of this type. 


Greedy Algorithms 


We now turn to the questions of generating good approximations for n term 
approximation from a general dictionary We shall assume that the 
dictionary Z is complete in the Hilbert space HI. This means that every 
element in H can be approximated arbitrarily well by linear combinations 
of the elements of Y. Since the dictionary is no longer an orthogonal basis 
as was considered above, we need to revisit how to find good n term 
approximations. Because of redundancy within the dictionary, we cannot 
simply pick the largest coefficients as we saw with a basis. Greedy 
algorithms are a method to generate good n term approximations. 


1. General Greedy Algorithm Given f, we want to generate an n-term 
approximation to f. 
Equation: 


fros= 25 
s=1 


The general steps are as follows: 


1. Initialize: (approximation) sg = 0, (residual) ro = f, 
approximation collectionAg = 0 

2. Search Y for some g € JY, then add g to the set A. 

3. Use {g1, 92, ---; 9n} to find new approximation for s,. 


At stage n, we have Sy, Tn = f — Sn, and A, = {g1, g2,---, gn}- 
There are many types of greedy algorithms. We describe the three most 
common in the case S is a Hilbert space. However, there are anlaogues 
of these for Lp. 

2. Pure Greedy Algorithm (PGA) Note: >From r,, choose 
Qn+1 = argmax|(r, (f), g)| (the g that causes the largest inner 
product). 
Equation: 


Snti = $n +t (Ta(f),9)9 


Equation: 
Tritt = f -— $n —- (f— $n 9)9 =f-Snu 


This method is similar to a steepest decent algorithm for decreasing the 
error. 

. Orthogonal Greedy Algorithm (OGA) >From r,, choose 

Qn+1 := argmax|(r, (f), g)| as in the PGA. 

Equation: 


Vn41 = SP{91, G25 ++) Gn+1} 
Equation: 


n+1 
Sn41:= Pvt = s 059; 
j=l 


where Py denotes the orthogonal projection onto the space V. We can 
find 8,41 = Py,,, f by solving the linear system of equations 
Equation: 


n+1 
(Sasonor) = ee Gk). 
j=l 


Then, ?n41 = f — $n41.- 

. Relaxed Greedy Algorithm (RGA) >From r,, choose g,,41 in some 
way (for example, our earlier methods) and then define 

Equation: 


Snii(f) = a8n + Bgnsi 


Unlike PGA, here we do not make a full step in the correct direction. 
For example, one way to proceed is to define 
Equation: 


argint, 64 | f -— a8, + BGn |=: ,8 5g 


This type of greedy algorithm is known to perform the best as compred 
with the previous two. 


Measuring Performance 


Given X, J, it is not practical to minimize o,,(f)x by searching over all the 
possibilities. The greedy approximation gives an n-term solution with less 
computation, but does it perform well? 


Let 
Equation: 


(9) = {fEX: Dregs, > Cg |< M} 
g¢D 


where the smallest M is the #! norm of f. 


For OGA or RGA as described above, we have 
Equation: 


eel 
| f-snf |< Cn-2|f|_g1. 
Remark 


Bat seats 1 ee 
Remark 5 This is similar to op (x);, <_n~? || & ||1, (n-term approximation) 
but its not always quite as good. 


Compressive Sensing 


We now consider a different setting. Suppose z € R™ and we wish to 
sample x, where taking a sample means the application of a linear 
functional A € R% to x. Next, we prescribe a budget of n samples, and 
consider all linear encoders using n samples. We can write these n linear 
functionals as ann x N matrix 6:R“% — R”. We then consider a decoder 
A:R” —> R%. Our approximation to z is thus A(#(z)). 


To make the problem precise, we first pick a measure for distortion: 


error = || — A(®(z))ll,. 


We next must make some assumption about xz. For example, we can assume 
that 


ge s,={x:4;,=0 fori¢g A, fA<k}, 
or 
xe, 


or 


Le We. 


We recall our basic problem: a signal x € R% will be “sampled” or 
“sensed” by applying the n linear projections represented by the columns of 
the sampling matrix @,,,.,. The resulting measurements are given in the 
vector y € R”, where y = @x. We will assume that n < N, meaning that 
in addition to thinking of of the sampling operation @ as an encoder, we can 
also view it as a projection to a lower dimensional linear subspace. In either 
case, we would like the measurements y to preserve as much information 
about the signal x as possible. To proceed in finding optimal solutions, we 
must formalize this problem and define how we will measure this 
information loss. 


Moving forward, a critical quantity for us will be the null space of the 
sampling matrix, N = N(®) = {x:&xr = 0}. Because we are trying to 
take as few measurements as possible, we will assume that we are taking 
measurements efficiently so that the rows of @ are all linearly independent. 
Therefore, assume that rank(®) = n, implying that NV has dimension 

(NV — n). The non-trivial null space of & means that it is not an invertible 
mapping. For any measurement vector y we can define the class of all 
observable signals that would result in the same measurement, 

F (y): = {x:Bx = y} C R%. The class ¥(y) can always be written as a 
sum of a vector in the class and a vector in the null space of the sampling 
matrix, F(y) = xo + N, where zp € F(y). If we have two vectors 
£0,L1 € F(y), then by linearity we know that (2; — 29) = 0. This fact 
implies that (x; — 29) € N, and consequently that 7; € ro + N. 


Associated to our encoder @, we shall have to describe a decoder A, which 
is a (not necessarily linear) map A:R” — RY. This decoder will take the 
measurements y and try to recover « as closely as possible, x = A(z). In 
order to design the best decoder A, we must specify our metric for 
measuring the quality of the estimate x. For the moment we shall always 
think of taking an optimal decoder for the problem at hand. Later we shall 
discuss specific and concrete decoders. 


To begin our discussion of the efficiency of the enoder @ we consider the 
following problem. 


Note: We will fix n and N and try to find @ and a decoder A such that for 
all input signals in a sparsity class x € 4’, we can get perfect 
reconstruction, A(@x) = x. We will be interested in determining what the 
largest value of k& is that we can find such an encoder/decoder pair. 


An important role in this problem and later problems of compressed sensing 
is played by certain submatrices of &. Given a set T C {1,2,...,N}, 
representing a collection of column indices we define the matrix ®7 as the 


one formed from ¢@ by using the columns from the set 7’. The matrix @¢ is 
a(n x #(T)) matrix. But sometimes we will also use the same notation 
7 to denote the matrix obtained from @ by setting all entries not in the 
columns of 7’ to zero. 


We can now state the following theorem. 
The following statements are all equivalent for any given ®: 


1. There exists a decoder A such that A(x) = = for alla € S'g. 

2. N(&) N Voz = {0}. 

3. &r has rank #(T) for all T with #(T) = 2k. 

4, 6'@> is non-singular (i.e., invertible) for all T with #(T’) = 2k. 


The equivalence of b + c + dis simple linear algebra. First let us prove 
that a + b. Assume a. Suppose that there was a vector 7 € NM 3/9. We 
know that we can write 7 = Xo — £1, where %g,x, both have support less 
than k. We could write 7 simply as a composite vector with support 2k 
where the first half of 7 is vo and the second half of 7 is x1, 7 = (20 | £1). 
We know that 7 € N, which implies that 7 = 0. It then follows that 

xy = x1, and we have as a consequence that Arq = Ax, and 
finally that x9 = x1. This proves that 7 = 0 as desired. 


Now let us prove that b + a. Suppose that we have a measurement 

Px = y © R”, where x € Sy. We will define the decoder A(y) to be the 
signal x € Y(y) with smallest support. Since x has support k so will'a. 
We claim that there is no other x’ € F(y)Mx. Indeed, if x’ existed, then 
x — x’ € NMS’. But, the only vector in NQ2’o%, is zero, implying that 
x = x’. This finally gives us that A(x) = x.0 


Using the previous theorem, we can turn to the question of finding good 
encoder/decoder pairs.[ footnote] Given a fixed NV, how large can k be, and 
what is the best ? Given a fixed n, the largest k is k = | } |. Alternatively, 
we can say that given a fixed k, we need at least n = 2k measurements. 
Another way to say this is that there exist encoding matrices ®9;,.. such 
that any selection of 2k columns are linearly independent. Examples are the 


DFT matrix or the Vandermonde matrix corresponding to interpolation at 
distinct points 21,...,Zy. 

A question about whether we will ever really find natural signals in >; 
brings to ming a story... Once upon a time, a man was floating over the 
countryside in a hot air balloon. The man in the balloon yelled down to a 
stranger on the ground and asked “Where am I?” The man on the ground 
thought for about 5 minutes and then answered “You’re in a hot air 
balloon.” The man in the balloon responded with “You must be a 
mathematician,” to which the man on the ground answered “Yes, how did 
you know?” “Because,” replied the man in the balloon, “you had to think a 
long time before you answered, your answer was very precise, and your 
answer was completely useless!” So, yes, we may be dealing with a limited 
model, but we have to crawl before we can walk. 


Gelfand n-widths 


Continuing from last time, we have a signal x € RX annx N 
measurement matrix ®, and y = x € R” is the information we draw from 
a. 


We consider decoders A mapping R” — R. We have been discussing 
whether there exists a decoder with certain properties. So for this discussion 
(about information preservation), we can just think about optimal decoding. 


While the previous result on sparse signal recovery is interesting, it is not 
very robust. What happens if our signal does not have support k? Can we 
still obtain meaningful results? The answer is yes. To do so we introduce 
more general input signal classes K C X that allows fully supported 
signals. For example, we will consider the signal class defined by the unit 
ball 

Equation: 


RSUGeG) =133 |e |e< 1. 


Given an encoder/decoder pair (@, A), the worst case error on a set K for 
that pair will be given by 
Equation: 


E(K,®,A) x =sup || x — A(z) ||. 
LE 


Finally, using min-max principles we will define the minimum error over 
all encoder/decoder pairs for a signal class and for a fixed number of 
measurements n to be 

Equation: 


E,(K)y= inf  E(K,®,A)y. 
CORR gat ae )x 


This measure F/,,(/) x is the best we could do while measuring distortion 
on the topology of X, using n linear measurements, and using arbitrary 
decoding. 


We will see that these questions are actually equivalent to a classical “n- 
width” problem. n-widths have seen a great deal of work over the years by 
a variety of mathematicians: Kolmogorov, Tikhomirov, Kashin, Gluskin, 
etc. There are many different flavors of n-widths, but we will study the 
Gelfand n-width (the least intuitive of the widths). 

Definition 

Gelfand n-width 


Let K C X be compact. Given n, the Gelfand width (also called the dual 
width) is given by 
Equation: 


d"(K)y: inf sup qilelleeze Koy. 


Y: codim(Y)= 


where by codimension (Y )=n we mean that Y has dimension 
dim (X)—n. 


In other words, we are looking for the subspace Y that slices through the set 
K so that the norms of the projected signals are as small as possible. We 
can now state the following theorem about n-widths: 


Provided that K has the properties (1) K = —K and(2) kK + K = Ck, 
then 
Equation: 


d”(K)x = E,(K)x = Cd" (K)x 


where C’ is the same constant listed in property (2).[ footnote | 
Clarifying notation: Ck = {Ca:x € K} and 
K+k= {x1 + €2:%1,02 € Kk}. 


We start with the left-hand inequality. We want to take any encoder/decoder 
pair and use that to construct a Y. So let , A be an encoder/decoder. Then 
simply let Y = W(@). Now consider an 7 € K MY and note that 

&(n) = Osince 7 € Y. Let z = A(0) be the decoding of 0 (practically 
speaking, z should be zero itself, but we avoid that assumption in this 
proof). Then 

Equation: 


alle < max (|| 7-2 |x, ll n+ |lx) 
= max (|| 7 — A®(n) || x, || -7 — A¥D(n) |x) 
sup || z — AF(z) || x 


IA 


where we first employ the triangle inequality, then the fact that multiplying 
by —1 does not change the norm, then the fact that A = —K. So then 
Equation: 


sup || 7 Il x <sup || « — AG(z) IIx 
ne kKnyY cek 


Taking the infimum over all #, A, it follows that 
Equation: 


d"(K) x < En(K) x. 


Since isn x N, then dim(.VW(@)) > N —n. 


Now we prove the right-hand inequality. Assume we have a good Y. 
Suppose Y has codimension NV — n. Then Y ~ (the orthogonal complement 
of Y in RY) has dimension n. Let v1, v2,--+,Un € R™ bea basis for Y*+. 
Let @ be the n x N matrix obtained by stacking the rows v1, v2,-°-, Un. 
Then () = Y. Define A(y) = any element of K M F(y) if there is 
one (otherwise let A(y) be anything in A(y)). Now look at the 
performance || z — A®(zx) || x forsome x € K. Both x and A® (x) =: x 


/ 


are elements of K, so x — x’ is in. W(@) and in CK. Therefore 
2a- © KNW (@). Thus, 
Equation: 


< sup z 
|< sup | lh 


and so for any z € K, 
Equation: 


ce as(z) iO eee 
ZzEYNK 


Taking the infimum over all Y, we get that F,(K)y < Cd”"(K) x. 


From the proof of this theorem, we see that there is a matching between the 
matrices @ and the spaces Y (via the nullspace of @). 


An important result is that d” (U cM )) is known for all p, q except 


a 
p= 1,q = ~.A precise statement of these widths can be found in the 
book [link]. A particularly important case is 

Equation: 


TeetN TO) < UC) < Cry A) 


for N > 2n. This result was first proved by Kashin with a worse logarithm 
in the upper inequality and later brought to the present form by Gluskin. 
This result solves several important problems in functional analysis and 
approximation. 


Instance Optimality 


Now we consider another way (actually two related ways) to measure 
optimality of an encoder/decoder pair. 


1. Instance optimality. Suppose we are in RY with ann x N 
measurement matrix @ and a decoder A . Recall that 
Equation: 


x(a)! = inf, || @— 2 lx 


We say that the encoding/decoding strategy &,A is instance optimal of 
order k with constant Co if 
Equation: 


|| « — A(x) || x < Cook (x) x 


for all x € RY . (Note that we are no longer restricting x to a class K 
.) Better & ’s have larger k for which this holds. The name “instance 
optimal” indicates that the encoding/decoding performance depends on 
each instance of x . 

2. Mixed-norm instance optimality (MNIO). Let g < p. The 
encoder/decoder pair &,A is MNIO for p,g,k, and C5 if 
Equation: 


k(@) pv 


| vc A®(z) lew < jj fl/a-t/p * 


Cases of interest include asking whether 
Equation: 

| z — A®(z) py < Cooe(2) pw 
and whether 


Equation: 


K(2) pw 


i 


|| c — AB(a) ley < Ci 


Let’s focus on instance optimality. It would be interesting to know whether 
a given © satisfies this property. To answer this question, we state an 
equivalent condition to instance optimality. 


Consider the statements 


1. &, A is instance optimal of order k on X . 

2. & has the following nullspace property (NSP): 
ll 7 Ix S C1 |] are [|x Vn © N(®),#T < k. 

3. || 7 Ix < Cioe(n) Vn € N(#). 

4. || nr lx < Ci || np |x Vn € N(®),#T <k. 


Then (b) and (c) are equivalent with the same constant; (d) is equvalent to 
(b) and (c) but with a different constant. Also (a) with a value k implies (b) 
with the same k , and (b) with a value 2k implies (a) with a value k . 


The Restricted Isometry Property 


We say that ann x N matrix & has the restricted isometry property (RIP) for & if for 
each T C {1,...,N} such that #T < k, &p (the matrix formed by choosing the 
columns of & whose indices are in T' ) has the property 


(1—dx)I| ex lle, S ll Or(2) Ile, < A+ 4x) Il ex le, (RIP) 


where 0 < 6, < 1. This useful definition is by Candes and Tao. The idea is that the 
embedding of a k -dimensional space in M -dimensional space almost preserves 
norm — like an isometry. Another way of looking at it is to consider the matrix 5.6 
, of size k x k.. This matrix is symmetric, positive definite, and it’s eigen-values are 
between 1 — 6, and1+6,. 


I prefer the following modified condition (dubbed the MIRP), which is more 
convenient for mathematical analysis: 


(c1) “|| er lle, < || &r(z) lle, < eall er lle, (MRIP) 


We can now state the following theorem. 
If & satisfies MRIP for 2k then 5A s.t. (®,A) is instance optimal for 2” for K . 


This shows that whenever we have a matrix @ satisfying the MRIP for 2k then it will 
perform well on encoding vectors (at least in the sense of vy accuracy). The question 
is how can we construct measurement matrices with this property? We can construct 
@ using Gaussian entries and then normalizing the columns. 


n 


dconstantc > Os.t.ifk < CTog(N Tay then with high probability ® satisfies RIP and 
MRIP fork. 


Given N and n, the range of k& in the above results reflects how accurately we can 
recover data. There is another constant c’ that serves as a converse bound for 
Theorem 3. This converse can be derived using Gluskin widths. 


Remark 


The following generic problem is of great interest: Consider the class of matrices 
= {@M x N,&has some prescribed property(eg. Toeplitz, circulant, etc.) } 
. What is the largest k for which such a matrix can have the MRIP. 


The Nullspace Property 


We begin with a property of the null space N which is at the heart of proving 
results on instance-optimality. 


We say that NV has the Null Space Property if for all 7 € N and all T with 
#T < k we have || 7 || x < e1|| nre |lx 


Intuitively, NSP implies that for any vector in the nullspace the energy will 
not be concentrated in a small number of entries. 


The following are equivalent formulations for NSP X fork: 


1. || 7 |x < cree (n) 
2. || nr Ix < ell ne ||x where n = nr+nre . 


Note also that the triangle inequality can be used as follows 


la llx = lar + nre lx < |) nr Ix +1 are Ix 
which shows that (b) is equivalent to NSP. 


1. If (&,A) is instance optimal on X for the value k , then © satisfies the 
NSP for 2k on X with an equivalent constant. 

2. If & has the NSP for X and 2k then 5A s.t. & has the instance optimal 
property for k . 


We will prove a slightly weaker version of this to save time. We first prove 
that instance optimality for k implies NSP X for k (hence this is slightly 
weaker than advertised) . Let 7 € N and set z = A(0) then 

Equation: 


| 7 — z ||< coon (n) instance optimal property 
Il 7+ 2 I|< coon(n) —zEW 
| 7 |< max{||7—z||,|n+2 ||} <cooe(n) _ triangle inequality 


We now prove 2. Suppose @& has the NSP for 2k . Given y , F 
(y) = {x-&(x) = y}. Let us define the decoder A by 


A(y): = argmin{ox(z)y:¢ € F(y)}, then 


Note that the instance optimal property automatically gives reproduction of 
K -sparse signals. 


At this stage the challenge is to create ® with this instance optimal property. 
For this we shall use the restricted isometry property as introduced earlier and 
which we now recall. 


Optimality and the MRIP 


From our last lecture, we are interested in signals x € RY, and we can 
make n measurements (or ask n questions) to obtain y = x € R”. We 
proposed several optimality criteria to make these measurements, i.e., to 
choose the measurement matrix @. 


1. For signals x € 4, (with k non-zero coefficients), we choose an 
encoder & and the corresponding decoder, A, such that A(®(x)) = z. 

2. For classes of signals e.g. K = U ice ), our performance measure is 
closely related to the Gelfand widths. 

3. Instance Optimal: our encoder and decoder, & and A, should satisfy 


Equation: 


jz — A®(2)|I, < Cooe(2), 


P 


We see that criterion (3) implies (1) since 0; (2) i 0 forx € D',. We also 


showed in the previous lecture that we have instance optimality for order k 
if and only if we have the Null Space Property (NSP) of order 2k. As a 
reminder, NSP of order 2k means that Vn € VY, VW =a: &(x4) = 0, we 
have || 77 ||, < C1¢2%(7)¢,- In other words, elements of 7 € -¥ are not 


sparse, and they are all of approximately equal size (i.e., they do not 
concentrate their entries in 2k positions). 


We mentioned that, in order to attain instant optimality, then <x N 
measurement matrix should have the modified restricted isometry 
property (MRIP) of order m, i.e., when we choose any m columns of & to 
obtain an xX m sub-matrix ®f where m < n, 

Equation: 


Cy |lzr|le, < \|Prerl|,, < Callers. 


If é has MRIP for m = 3k, then & has NSP for £; of order 2k, and so @ is 
instance optimal in £; of order k, i.e., 


Equation: 


le — AF(e)||_, < Coula),, 


A related result is that under the same assumption, 
Equation: 


Coz (2) >, 


||z — A®(a)||,, < ai 


To prove ([link]), we need to only show NSP for £; and 2k, i.e., 
Equation: 


Inlle, < Coalmgsn € A. 


Given 7, let To be the set of indices of the 2k largest entries, JT; be the set 
of indices of the k next largest entries, 7’> be the set of indices of the k next 
largest entries, and so on. 

Equation: 


NM = 1% + NT: 

7 = NM +I + + NT 

= O(n +77, +..+7,) = 0, 

= —&(n7,+...+n7,), by linearity 


= -°(nn). 


Therefore, we can estimate 
Equation: 


IA 


IImolle, C2||P(n0)||2,, by restricted isometry 


IA 


Cy S° ||P(nz,) || ,,’ by triangle inequality 
jz2 


s 
Ce S> |r, || ;, by restricted isometry. 
=) 


IA 


Since, for 7 > 2, nr, is the best k-term approximation to nr, , + 77;; 
Equation: 


|| nr, | les = Ok (nT). ar NT) ¢ 


2 


Furthermore, we know for any q < p, 
Equation: 


i_di 
ou(n),, Sk? aly, 


Combining ([link]) and ([link]), we obtain 
Equation: 
I|n7; 1 + NT; | B 


Vk 


IInz,||,, S 


Substituting ({link]) back into ([link]), we now have 
Equation: 


oe 
Inolle, < Te Og lis + mle 


= or EY Ihomall + lle 
Vk j=2 
Ogre 
A : Wa || nr, le, 
ve 
< : orn (7)¢, 


The last step is due to the fact that Si | nr, | | t, is the 2k-term 


approximation error for 77. Notice this is only true for 2,. This completes the 
proof of ([link)). 


To prove ([link]), we let x; be the j-th entry in n7,. By the Cauchy-Schwarz 


Inequality, 
Equation: 
2k 
IIn7olle, — S° oe 
j=l 
1 = 
2k z / 2k 2 
< (-¥) (>: x; 7 by CSI 
j=l j=l 
a Vv 2k\|n7olle, 
S Vv 2k||nol|¢, 
< 2 2Cfor(n),,, 


The 2k-term approximation error in £, for 7 can be expressed as 
Equation: 


IInr, +. + n7,|[2, = T2u(7) ,- 


Since 
Equation: 
N= + (nr, +..+n7,), 


we can finally prove that 
Equation: 


Ile, < |lnrolle, + lln7, +--+ 77,\|~,,by triangle inequality 
2V2CZo2x(n)p, + o2K(N) ¢, 


= (2V2C3 + Lor (n),,- 


Therefore, we have proved ((link]) with C' = 2V/ 20. + 1.0 


Summary 


Review 


Last time we proved that for each k < cq TouN]n’ there exists ann x N 


matrix ® and a decoder A such that 


° (a) || e — AP(x) || < corn) e 
¢ (b)|| 2 — Ag(a) lle < Co 
Recall that we can find such a @ by setting the entries [4 ;, = ~;,.(w) to 


be realizations of independent and identically distributed Gaussian random 
variables. 


Deficiencies 


Decoding is not implementable 


Our decoding “algorithm” is: 
Equation: 


A (y) := argmin,..g(y7K(Z),, 


where F(y) := {x : (x) = y}-. In general, this algorithm is not 
implementable. This deficiency, however, is easily repaired. Specifically, 
define 

Equation: 


Ai (y) := argmin,. gy) || © |lp,- 


Then (a) and (b) hold for Ay in place of A. This decoding algorithm is 
equivalent to solving a linear programming problem, thus it is tractable and 
can be solved using techniques such as the interior point method or the 
simplex method. In general, these algorithms have computational 


complexity O (N ae For very large signals this can become prohibitive, and 
hence there has been a considerable amount of research in faster decoders 
(such as decoding using greedy algorithms). 


We cannot generate such ® 


The construction of a from realizations of Gaussian random variables is 
guaranteed to work with high probability. However, we would like to know, 
given a particular instance of &, do (a) and (b) still hold. Unfortunately, this 
is impossible to check (since, to show that @ satisfies the MRIP for k, we 
need to consider all possible submatrices of @). Furthermore, we would like 
to build & that can be implemented in circuits. We also might want fast 
decoders A for these &. Thus we also may need to be more restrictive in 
building &. Two possible approaches that move in this direction are as 
follows: 


1. Find @ that we can build such that we can prove instance optimality in 
£, for a smaller range of k, i.e., 
Equation: 


|| c — AD(z) |le,< coon (z)e, 


for k < K. If we are willing to sacrifice and let K be smaller than 
before, for example, K ~ 4/n, then we might be able to prove that 
fOr is diagonally dominant for all T’ such that {7 = 2k, which 
would ensure that @ satisfies the MRIP. 

2. Consider &(w) where w is a random seed that generates a ©. It is 
possible to show that give x, with high probability, (w)(x) = y 
encodes x in an £9-instance optimal fashion: 

Equation: 


||  — & |le,< 2oz(2)e, 


fork < cg Toa Jaye Thus, by generating many such matrices we can 


recover any x with high probability. 


Encoding signals 


Another practical problem is that of encoding the measurements y. In a real 
system these measurements must be quantized. This problem was addressed 
by Candes, Romberg, and Tao in their paper Stable Signal Recovery from 
Incomplete and Inaccurate Measurements. They prove that if y is quantized 
to y, andif z € U(£,) for p < 1, then we get optimal performance in terms 
the number of bits required for a given accuracy. Notice that their result 
applies only to the case where p < 1. One might expect that this argument 
could be extended to p between 1 and 2, but a warning is in order at this 
stage: 


Fix 1 < p < 2. Then there exist @ and A\ satisfying 
Equation: 


|| 2 — AG (zx) lle, < Coon(zx)., 


if 
Equation: 


oa: n a 
k <coN 12 | ———— ' 
log N/n 
Furthermore, this range of k is the best possible (save for the log term). 
Examples: 


¢ p =1, we get our original results 
° p = 2, we do not get instance optimal for k = 1 unlessn ~ N 


3 
p= 3, we only get instance optimal if k < coN ve ( RIVA ) 


